NVIDIA has introduced NVLM 1.0, a series of advanced multimodal large language models (LLMs) that excel in vision-language tasks, competing with both proprietary models like GPT-4o and open-access models such as Llama 3-V 405B and InternVL 2. The NVLM-D-72B model, which is part of this release, is a decoder-only architecture that has been open-sourced for community use. Notably, NVLM 1.0 demonstrates enhanced performance in text-only tasks compared to its underlying LLM framework after undergoing multimodal training. The model has been trained using the Megatron-LM framework, with adaptations made for hosting and inference on Hugging Face. This adaptation allows for reproducibility and comparison with other models. Benchmark results indicate that NVLM-D 1.0 72B achieves impressive scores across various vision-language benchmarks, such as MMMU, MathVista, and VQAv2, showing competitive performance against other leading models. In addition to multimodal benchmarks, NVLM-D 1.0 also performs well in text-only benchmarks, showcasing its versatility. The model's architecture allows for efficient loading and inference, including support for multi-GPU setups. Instructions for preparing the environment, loading the model, and performing inference are provided, ensuring that users can effectively utilize the model for their applications. The model's inference capabilities include both text-based conversations and image-based interactions. Users can engage in pure-text dialogues or ask the model to describe images, demonstrating its multimodal capabilities. The documentation includes detailed code snippets for loading images, preprocessing them, and interacting with the model. The NVLM project is a collaborative effort, with contributions from multiple researchers at NVIDIA. The model is licensed under the Creative Commons BY-NC 4.0 license, allowing for non-commercial use. The introduction of NVLM 1.0 marks a significant advancement in the field of multimodal AI, providing powerful tools for developers and researchers alike.
Wednesday, October 2, 2024